For each of the distributions below, please include the multi-panel figure generated in your lab report and provide a short answer to the questions that follow. If you write any additional code in deriving your answers (you may not need to in this lab), include it and any R outputs or figures it generates. When doing calculations “by hand,” include the formulas you used. You do not have to show all your work, but it may make it easier to assign partial credit.
For this lab, turn everything in as a single .Rmd file by the start of next lab.
Note, if you didn’t start with the tutorial above, you’ll need to define x and p variables before running the codes below.
The uniform is used when there’s an equal probability of an event occurring over a range of values.
x = seq(-5,5,by=.1)
p = seq(0,1,by=0.01)
plot(x,dunif(x),type='l')
lines(x,dunif(x,0,4),col=2)
lines(x,dunif(x,-3,-0.5),col=3)
plot(x,punif(x),type='l')
lines(x,punif(x,0,4),col=2)
lines(x,punif(x,-3,-0.5),col=3)
plot(p,qunif(p),type='l',ylim=range(x))
lines(p,qunif(p,0,4),col=2)
lines(p,qunif(p,-3,-0.5),col=3)
hist(runif(500,-3,3),breaks=30,xlim=range(x),probability=TRUE)
lines(x,dunif(x,-3,3),col=2)
1. Why does the height of the uniform PDF change as the width changes?
When the width of the uniform increases, we still have the same number of observations, which all have equal probabiity of occurring. The higher number of possible values means that the probability for any one value must decrease, lowering the overall height of the PDF.
2. What do the second and third arguments to the uniform specify? What are their default values?
The second argument to the uniform (dunif()) is the minimum value, which has a default of 0, and the third argument is the maximum, which has a default of 1. This creates an equal probability for all values between zero and one.
The Beta is an interesting distribution because it is bound on the range 0 <= X <= 1 and thus is very often used to describe data that is a proportion. At first glance the mathematical formula for the Beta looks a lot like the Binomial:
\[Beta(x \mid a,b) \propto x^{a-1} (1-x)^{b-1}\] \[Binom(x \mid n,p) \propto p^x (1-p)^{n-p}\]
The critical difference between the two is that in the Beta the random variable X is in the base while in the Binomial it is in the exponent. Unlike many distributions you may be have used in the past, the two shape parameters in the Beta do not define the mean and variance, but these can be calculated as simple functions of \(\alpha\) and \(\beta\). The Beta does have an interesting property of symmetry though, whereby Beta(\(\alpha\), \(\beta\)) is the reflection of Beta(\(\beta\),\(\alpha\)).
p = seq(0,1,by=0.01)
plot(p,dbeta(p,5,5),type='l', xlim=c(0,1))
lines(p,dbeta(p,1,1),col=2)
lines(p,dbeta(p,0.2,0.2),col=3)
## vary beta
plot(p,dbeta(p,6,6),type='l',ylim=c(0,5))
lines(p,dbeta(p,6,4),col=2)
lines(p,dbeta(p,6,2),col=3)
lines(p,dbeta(p,6,1.25),col=4)
lines(p,dbeta(p,6,1),col=5)
lines(p,dbeta(p,6,0.25),col=6)
legend("topleft",legend=c(6,4,2,1.25,1,0.5),lty=1,col=1:6)
plot(p,pbeta(p,5,5),type='l')
lines(p,pbeta(p,1,1),col=2)
lines(p,pbeta(p,0.2,0.2),col=3)
hist(rbeta(500,3,3),breaks=30,xlim=range(p),probability=TRUE)
lines(p,dbeta(p,3,3),col=2)
lines(p,dbeta(p,3,3),col=2)
Lab questions:
3) The Beta has a special case, Beta(1,1) that is equivalent to what other PDF?
The beta distribution with shape parameters both equal to one is equivalent to the Uniform distribution bounded at zero and one:
plot(p,dbeta(p,1,1),type='l', xlim=c(-1,2), col=1)
lines(p,dunif(p,0,1),col=2)
4) In the first panel, the mean is the same for each line (0.5). What are the variances? (Hint: Calculate this analytically. Look up the distribution in one of the recommended references.)
The shape parameters are (5,5), (1,1), and (.2,.2) in the first panel. According to the Wikipedia page for Beta distributions, the shape parameters can be plugged into the following equation to calculate variances:
\(Var[X] = αβ/(α+β)^2(α+β+1)\)
For α = β = 5, (55)/((5+5)^2(5+5+1)) = .023
For α = β = 1, (11)/((1+1)^2(1+1+1)) = .083
For α = β = .2, (.2.2)/((.2+.2)^2(.2+.2+1)) = .179
5) In the second panel, what are the means and medians of each of these curves? (Hint: you'll need to calculate the mean analytically and use one of the variants of R's beta function to find the median.)
The equation to find the mean (according to the Wikipedia page for Beta distribution) is as follows: α/(α + β)
To find the median, we can use the quartile function for the beta distribution, using “0.5” as the first argument: median of beta(p,α,β) = qbeta(0.5, α, β)
Curve 1: α = 6, β = 6
Curve 2: α = 6, β = 4
Curve 3: α = 6, β = 2
Curve 4: α = 6, β = 1.25
Curve 5: α = 6, β = 1
Curve 6: α = 6, β = 0.25
The lognormal is a log transform of the normal distribution. It is defined on the range X > 0 so is commonly used for data that cannot be negative by definition. The distribution is also positively skewed so is often used for skewed data. One thing that often goes unappreciated with the log-normal is that the mean, E[X], depends on the variance:
\[E[X] = e^{\mu + {{1}\over{2}}\sigma^2}\]
This applies not just when you explicitly use the lognormal, but also whenever you log-transform data and then calculate a mean or standard deviation – a fact that is vastly under-appreciated in the biological and environmental sciences and frequently missed in the published literature. In fact, ANY data transformation applied to make data “more normal” will change the mean, with the functional form of the bias depending on the transformation used. You can not simply back-transform the data without correcting for this bias. This phenomena is another illustration of Jensen’s Inequality.
## changing the mean
x <- 10^seq(-2,2,by=0.01)
plot(x,dlnorm(x,0),type='l',xlim=c(0,15),main="Changing the Mean")
lines(x,dlnorm(x,1),col=2)
lines(x,dlnorm(x,2),col=3)
legend("topright",legend=0:2,lty=1,col=1:3)
## on a log scale
plot(x,dlnorm(x,0),type='l',log='x',main="Log Scale")
lines(x,dlnorm(x,1),col=2)
lines(x,dlnorm(x,2),col=3)
abline(v=exp(0),col=1)
abline(v=exp(1),col=2)
abline(v=exp(2),col=3)
legend("topright",legend=0:2,lty=1,col=1:3)
## changing the variance
plot(x,dlnorm(x,2,.125),type='l',xlim=c(0,20),ylim=c(0,0.6),main="Changing the Variance")
lines(x,dlnorm(x,2,0.25),col=2)
lines(x,dlnorm(x,2,0.5),col=3)
lines(x,dlnorm(x,2,1),col=4)
lines(x,dlnorm(x,2,2),col=5)
lines(x,dlnorm(x,2,4),col=6)
abline(v=exp(2),col=1)
legend("topright",legend=c(0.125,0.25,0.5,1,2,4),lty=1,col=1:6)
## random sample
hist(rlnorm(250,2,1),breaks=30,probability=TRUE)
lines(x,dlnorm(x,2,1),col=4)
6) What are the arithmetric and geometric means of the three curves in the first panel? (Reminder: arithmetic means are means in the linear domain, geometric means are means in the log domain)
To find the arithmetic mean, we can use the equation provided above:
\[E[X] = e^{\mu + {{1}\over{2}}\sigma^2}\] Where \(mu\) is the logmean, or the second argument in the dlnorm function, and \(sigma\) is logsd, which we are leaving at its default of 1. This simplifies the equation to: \[E[X] = e^{\mu + 1/2}\]
For line 1: \(mu\) = 0, so \(E[X] = e^{0+1/2}\) = 1.65
For line 2: \(mu\) = 1, so \(E[X] = e^{1+1/2}\) = 4.48
For line 3: \(mu\) = 2, so \(E[X] = e^{2+1/2}\) = 12.18
To find the geometric mean, we can use the equation from the Log-normal Distribution Wikipedia page, which states that:
\(GM[X] = e^\mu\), which is also equal to the median.
For line 1: \(mu\) = 0, so \(E[X] = e^{0}\) = 1
For line 2: \(mu\) = 1, so \(E[X] = e^{1}\) = 2.72
For line 3: \(mu\) = 2, so \(E[X] = e^{2}\) = 7.34
We can verify using qlnorm(0.5, mu) that these values are indeed the median.
The exponential distribution arises naturally as the time it takes for an event to occur when the average rate of occurrence, r, is constant. The exponential is a special case of the Gamma (discussed next) where \(Exp(X \mid r) = Gamma(X \mid 1,r)\). The exponential is also a special case of the Weibull, \(Exp(X \min r) = Weibull(X \mid r,1)\), where the Weibull is a generalization of the exponential that allows the rate parameter r to increase or decrease with time. The Laplace is basically a two-sided exponential and arises naturally if one is dealing with absolute deviation, \(|x-m|\), rather than squared deviation, \((x-m)^2\), as is done with the normal.
## changing the mean
x <- seq(0,10,by=0.01)
plot(x,dexp(x,0.125),type='l',ylim=c(0,1))
lines(x,dexp(x,0.25),col=2)
lines(x,dexp(x,0.5),col=3)
lines(x,dexp(x,1),col=4)
lines(x,dexp(x,2),col=5)
lines(x,dexp(x,4),col=6)
legend("topright",legend=c(0.125,0.25,0.5,1,2,4),lty=1,col=1:6)
## random sample
hist(rexp(250,2),breaks=30,probability=TRUE)
lines(x,dexp(x,2),col=4)
## laplace vs Gaussian
plot(x,dexp(abs(x-5),1)/2,type='l')
lines(x,dnorm(x,5),col=2)
plot(x,dexp(abs(x-5),1)/2,type='l',log='y') ## same plot as last but on a log scale
lines(x,dnorm(x,5),col=2)
7) The last two panels compare a normal and a Laplace distribution with the same mean and variance. How do the two compare? In particular, compare the difference in the probabilities of extreme events in each distribution.
The two distributions have the same mean and variance, but different probabilities of extreme events as well as the most common events. For the Laplacian distribution, which is essentially two exponential distributions back-to-back, there is a single value with extremely high probability. Although the probability of (absolute) values decreases rapidly as we move away from the mean, the probability doesn’t reach zero as quickly as the Normal distribution does, so the tails stretch out longer, allowing for a higher probability of extreme values. For the normal distribution, the high-probability values are less concentrated (more apparent in the un-log plots), but the tails are not as long. The log version of this comparison makes the difference in extreme values much clearer.
The gamma and inverse-gamma distribution are flexible distributions defined for positive real numbers. These are frequently used to model the distribution of variances or precisions (precision = 1/variance), in which case the shape and rate parameters are related to the sample size and sum of squares, respectively. The gamma is frequently used in Bayesian statistics as a prior distribution, and also in mixture distributions for inflating the variance of another distribution
x <- seq(0,10,by=0.01)
## change rate
plot(x,dgamma(x,3,3),type='l' ,ylim=c(0,1.6))
lines(x,dgamma(x,3,1),col=2)
lines(x,dgamma(x,3,6),col=3)
lines(x,dgamma(x,3,1/3),col=4)
legend("topright",legend=c(3,1,6,0.33),lty=1,col=1:4)
## change shape
plot(x,dgamma(x,3,3),type='l')
lines(x,dgamma(x,1,3),col=2)
lines(x,dgamma(x,6,3),col=3)
lines(x,dgamma(x,18,3),col=4)
legend("topright",legend=c(3,1,6,18),lty=1,col=1:4)
## change variance
a <- c(20,15,10,5,2.5,1.25)
r <- c(4,3,2,1,0.5,0.25)
plot(x,dgamma(x,20,4),type='l')
for(i in 1:6){
lines(x,dgamma(x,a[i],r[i]),col=i)}
var = a/r^2
mean = a/r
legend("topright",legend=format(var,digits=3),lty=1,col=1:6)
var
## [1] 1.250000 1.666667 2.500000 5.000000 10.000000 20.000000
mean
## [1] 5 5 5 5 5 5
## change mean
var = 4
mean = c(1,2,3,4,5)
rate = mean/var
shape = mean^2/var
plot(x,dgamma(x,shape[2],rate[2]),type='l',col=2)
for(i in 1:5){
lines(x,dgamma(x,shape[i],rate[i]),col=i)
}
legend("topright",legend=1:5,lty=1,col=1:5)
rate
## [1] 0.25 0.50 0.75 1.00 1.25
shape
## [1] 0.25 1.00 2.25 4.00 6.25
8) Looking at the 'change variance' figure, how does the variance change as a and r increase? Qualitatively, how does this affect the mode and skew? Quantitatively, how does this affect the median (relative to the mean)?
As the rate increases in the first figure, we see the tails of the distribution become smaller, while the mode moves toward zero. This makes sense, because the rate parameter represents the interval between events. As rate increases (such as rate = 6, green line in the first figure), the variance decreases. This relationship between variance and rate is due to the fact that the rate parameter is squared in the denominator of the variance equation. The skewness of the distribution is related to the shape parameter, which is being changed in the second figure above. As shape increases, the data becomes more asymmetric, with the median being pulled further away from 0.
The binomial arises naturally from counts of the number of successes given a probability of success, p, and a sample size, n. You are probably already familiar with the binomial in the context of coin toss examples.
x <- 0:11
## vary size of sample (number of draws)
size = c(1,5,10)
plot(x,dbinom(x,size[1],0.5),type='s')
for(i in 2:3){
lines(x,dbinom(x,size[i],0.5),type='s',col=i)
}
legend("topright",legend=size,lty=1,col=1:3)
## vary probability
n = 10
p = c(0.1,0.25,0.5)
plot(x,dbinom(x,n,p[1]),type='s')
for(i in 2:3){
lines(x,dbinom(x,n,p[i]),col=i,type='s')
}
abline(v = n*p,col=1:3,lty=2)
legend("topright",legend=p,lty=1,col=1:3)
## CDF
plot(x,pbinom(x,n,p[1]),type='s',ylim=c(0,1))
for(i in 2:3){
lines(x,pbinom(x,n,p[i]),col=i,type='s')
}
legend("bottomright",legend=p,lty=1,col=1:3)
## Random samples
hist(rbinom(100,10,0.5)+0.0001, ## small amount added because
probability=TRUE, ## of way R calculates breaks
breaks=0:11,main="Random Binomial")
lines(x,dbinom(x,10,0.5),type='s',col=2) #Analytical solution
legend("topright",legend=c("sample","pmf"),lty=1,col=1:2)
9) Consider a binomial distribution that has a constant mean, np. What are the differences in the shape of this distribution if it has a high n and low p vs. a low n and high p?
Our expected number of successes can be estimated by multiplying n (our number of samples) by p (probability of success). Below, I have plotted two distributions with the same mean of 5; the red has an n of 100 and a p of .05, the black has an n of 10 and p of .5. The high-n (red line) skews toward the right, while the high p is more symmetrical. A high number of samples will result in a distribution with lower variance, due to the central limit theorem. We see this in the first figure above, where a low sample size results in distribution that is not as clustered around the true probability. The second figure above shows that a lower value of p results in a lower variance.
x <- 1:20
# both have a mean of 5
plot(x,dbinom(x,10,.5),type='s') # high p: black line
lines(x,dbinom(x,100,.05),type='s', col=2) #high n: red line
The Poisson is also very common for count data and arises as the number of events that occur in a fixed amount of time (e.g. number of bird sightings per hour), or the number of items found in a fixed amount of space (e.g. the number of trees in a plot). Unlike the Binomial distribution the Poisson doesn’t have a fixed upper bound for the number of events that can occur.
x <- 0:12
plot(x,dpois(x,1),type='s')
lines(x,dpois(x,2),type='s',col=2)
lines(x,dpois(x,5),type='s',col=3)
legend("topright",legend=c(1,2,5),lty=1,col=1:3)
x <- 20:80
plot(x,dpois(x,50),type='s',ylim=c(0,0.08)) #Poisson with mean 50 (variance = 50)
lines(x+0.5,dnorm(x,50,sqrt(50)),col=2) #Normal with mean and variance of 50
lines(x,dbinom(x,100,0.5),col=3,type='s') #Binomial with mean 50 (variance = 25)
legend("topright",legend=c("pois","norm","binom"),col=1:3,lty=1)
plot(x,dbinom(x,100,0.5),type='s',col=3) #Binomial with mean 50 (variance = 25)
lines(x+0.5,dnorm(x,50,sqrt(25)),col=2) #Normal with mean 50 and variance of 25
lines(x,dpois(x,50),col=1,type='s') #Poisson with mean 50 (variance = 50)
legend("topright",legend=c("pois","norm","binom"),col=1:3,lty=1)
The last two panels depict a comparison of the Poisson, Normal, and Binomial with the same mean and a large sample size. The Poisson and Binomial are identical in the two figures but in the first the normal has the same variance as the Poisson and the second it is the same as the binomial.
10) Normal distributions are often applied to count data even though count data can only take on positive integer values. Is this fair is this to do in these two examples? (i.e. how good is the normal approximation)
Normal distributions may be a good enough description of count data if the mean and standard deviation are such that a non-zero value would be very very unlikely. In the first example, with a mean (/variance) of one, the distribution is asymmetrical because it is close to zero. With a mean of 5, the Poisson gets much closer to a Normal distribution. In the second example, with a high mean, we see that Poisson and Normal appear similar when the Normal variance is the same as its mean, whereas a low-variance Normal can approximate the Binomial.
11) Would the normal be a fair approximation to the Poisson curves for small numbers (the first panel)? How about for the Binomial for small numbers (earlier panel of figures on the Binomial)?
No, the Normal would not be a good approximation to the Poisson for small numbers. For small means, the Normal has a higher probability of containing negative values. To approximate the Poisson or Binomial, we would need to truncate or censor the negative part of the Normal distribution.
12) Is the Poisson a good approximation of the Binomial?
If there is a low p (probability of success) and a high n (sample size), then the Poisson is a good approximation of the Binomial, despite the Poisson allowing for continuous values.
13) Is it possible to choose the parameters so that the Poisson and Binomial to both have the same mean and variance? If so what is this parameterization?
Yes. If n is very large, and p is close to zero or one, then the Poisson can have the same mean and variance as the Binomial. Multiplying p and n gives you the lambda value for the Poisson distribution:
x <- 0:30
plot(x, pbinom(x, 1000, .002), type='s')
lines(x, ppois(x, lambda = 2))
x <- 0:30
plot(x, pbinom(x, 100, .03), type='s')
lines(x, ppois(x, lambda = 3))
The negative binomial has two interesting interpretations that are subtlety different from either the Poisson or Binomial. In the first case, it is the number of trials needed in order to observe a fixed number of occurrences, which is the opposite from the Binomial’s number of occurrences in a fixed trial size and thus where it gets its name. The Negative Binomial also arises as the distribution of number of events that occur in a fixed space or time when the rate is not constant (as in the Poisson) but varies according to a Gamma distribution. Hence the Negative Binomial is also used to describe data that logically seems to come from a Poisson process but has greater variability that is expected from the Poisson (which by definition has a variance equal to its mean). The Geometric distribution arises as a special case of the negative binomial where the number of occurrences is fixed at 1.
x <- 0:20
## negative binomial
## vary size
plot(x,dnbinom(x,1,0.5),type="s",main="vary size")
lines(x,dnbinom(x,2,0.5),type="s",col=2)
lines(x,dnbinom(x,3,0.5),type="s",col=3)
lines(x,dnbinom(x,5,0.5),type="s",col=4)
lines(x,dnbinom(x,10,0.5),type="s",col=5)
legend("topright",legend=c(1,2,3,5,10),col=1:5,lty=1)
## vary probability
plot(x,dnbinom(x,3,0.5),type="s",main="vary probability")
lines(x,dnbinom(x,3,0.3),type="s",col=2)
lines(x,dnbinom(x,3,0.2),type="s",col=3)
lines(x,dnbinom(x,3,0.1),type="s",col=4)
legend("topright",legend=c(0.5,0.3,0.2,0.1),col=1:5,lty=1)
## vary variance , alternate parameterization
mean = 8
var = c(10,20,30)
size = mean^2/(var-mean)
plot(x,dnbinom(x,mu=mean,size=size[1]),type="s",ylim=c(0,0.14),main="vary variance")
lines(x,dnbinom(x,mu=mean,size=size[2]),type="s",col=2)
lines(x,dnbinom(x,mu=mean,size=size[3]),type="s",col=3)
legend('topright',legend=format(c(var,mean),digits=2),col=1:4,lty=1)
lines(x,dpois(x,mean),col=4,type="s")
## NB as generalization of pois with inflated variance
## geometric
plot(x,dgeom(x,0.5),type="s",main="Geometric")
lines(x,dgeom(x,0.15),type="s",col=2)
lines(x,dgeom(x,0.05),type="s",col=3)
lines(x,dnbinom(x,1,0.15),type="s",col=4,lty=2)
## geometric as special case of NB where size = 1
14) In the 'vary size' panel, what are the means of the curves?
According to the R help page for the NegBinomial, “the mean is μ = n(1-p)/p.” If p remains constant at .5 for all curves, then the .5 in the numerator and denominator cancel out. As n goes from 1,2,3,5,10, the respective means are also 1,2,3,5,10.
15) In the “vary variance” panel, how does the shape of the Negative Binomial compare to the Poisson?
In the ‘vary variance’ panel, the mean for the 3 Negative Binomial and 1 Poisson curve stay constant (mean = 8) but the variance for the Negative Binomial increases to values of 10, 20, and 30. Because all four are zero-bound, the histograms become increasingly asymmetric, and the right tails of the Negative Binomial are larger than the Poisson due to the higher variance. Below, we can see that if the mean is further from 1, then the change in skew becomes less pronounced.
## vary variance , alternate parameterization
mean = 15
var = c(17,25,35)
size = mean^2/(var-mean)
plot(x,dnbinom(x,mu=mean,size=size[1]),type="s",ylim=c(0,0.14),main="vary variance")
lines(x,dnbinom(x,mu=mean,size=size[2]),type="s",col=2)
lines(x,dnbinom(x,mu=mean,size=size[3]),type="s",col=3)
legend('topright',legend=format(c(var,mean),digits=2),col=1:4,lty=1)
lines(x,dpois(x,mean),col=4,type="s")